Lineage Tracing in a Data Warehousing System

نویسندگان

  • Yingwei Cui
  • Jennifer Widom
چکیده

A data warehousing system collects data from multiple distributed sources and stores the integrated information as materialized views in a local data warehouse. Users then perform data analysis and mining on the warehouse views. Figure 1 shows the basic architecture of a data warehousing system. In many cases, the warehouse view contents alone are not su cient for in-depth analysis. It is often useful to be able to \drill through" from interesting (or potentially erroneous) view data to the original source data that derived the view data. For a given view data item, identifying the exact set of base data items that produced the view data item is termed the view data lineage problem. Motivation for and applications of lineage tracing in a warehousing environment are provided in [2]. In the context of the WHIPS data warehousing project at Stanford [3], we have developed a complete prototype that performs e cient and consistent lineage tracing. Some commercial data warehousing systems support schema-level lineage tracing, or provide specialized drill-down and/or drill-through facilities for multi-dimensional warehouse views. Our lineage tracing prototype supports more ne-grained instance-level lineage tracing for arbitrarily complex relational views, including aggregation. Our prototype automatically generates lineage tracing procedures and supporting auxiliary views at view de nition time. At lineage tracing time, the system applies the tracing procedures to the source tables and/or auxiliary views to obtain the lineage results and show the speci c view data derivation process.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Lineage Tracing in a Data Warehousing System Demonstration Proposal

A data warehousing system collects data from multiple distributed sources and stores the inte grated information as materialized views in a local data warehouse Users then perform data analysis and mining on the warehouse views Figure shows the basic architecture of a data warehousing system In many cases the warehouse view contents alone are not su cient for in depth analysis It is often usefu...

متن کامل

Practical Lineage Tracing in Data Warehouses

We consider the view data lineage problem in a warehousing environment For a given data item in a materialized warehouse view we want to identify the set of source data items that produced the view item We formalize the problem and we present a lineage tracing algorithm for relational views with aggregation Based on our tracing algorithm we propose a number of schemes for storing auxiliary view...

متن کامل

Using Schema Transformation Pathways for Data Lineage Tracing

With the increasing amount and diversity of information available on the Internet, there has been a huge growth in information systems that need to integrate data from distributed, heterogeneous data sources. Tracing the lineage of the integrated data is one of the problems being addressed in data warehousing research. This paper presents a data lineage tracing approach based on schema transfor...

متن کامل

Cost Effective Forward Tracing Data Lineage

Data lineage plays a critical role in verifying data correctness in scientific databases and data warehousing. In this paper, we clearly define forward data lineage in bag semantics and show its properties. We propose a tracing method that piggybacks normal query evaluation. Our method effectively supports aggregation and variable granularity lineage. More importantly, it features cost effectiv...

متن کامل

Investigating a heterogeneous data integration approach for data warehousing

Data warehouses integrate data from remote, heterogeneous, autonomous data sources into a materialised central database. The heterogeneity of these data sources has two aspects, data expressed in different data models, called model heterogeneity, and data expressed within different schemas of the same data model, called schema heterogeneity. AutoMed is an approach to heterogeneous data transfor...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000